讀取kaggle_titanic_train.csv與kaggle_titanic_test.csv。觀察資料結構,發現Age、Fare與Embarked含有遺漏值。
train <- read.csv('kaggle_titanic_train.csv')
test <- read.csv('kaggle_titanic_test.csv')summary(train)## PassengerId Survived Pclass
## Min. : 1.0 Min. :0.0000 Min. :1.000
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000
## Median :446.0 Median :0.0000 Median :3.000
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Name Sex Age
## Abbing, Mr. Anthony : 1 female:314 Min. : 0.42
## Abbott, Mr. Rossmore Edward : 1 male :577 1st Qu.:20.12
## Abbott, Mrs. Stanton (Rosa Hunt) : 1 Median :28.00
## Abelson, Mr. Samuel : 1 Mean :29.70
## Abelson, Mrs. Samuel (Hannah Wizosky): 1 3rd Qu.:38.00
## Adahl, Mr. Mauritz Nils Martin : 1 Max. :80.00
## (Other) :885 NA's :177
## SibSp Parch Ticket Fare
## Min. :0.000 Min. :0.0000 1601 : 7 Min. : 0.00
## 1st Qu.:0.000 1st Qu.:0.0000 347082 : 7 1st Qu.: 7.91
## Median :0.000 Median :0.0000 CA. 2343: 7 Median : 14.45
## Mean :0.523 Mean :0.3816 3101295 : 6 Mean : 32.20
## 3rd Qu.:1.000 3rd Qu.:0.0000 347088 : 6 3rd Qu.: 31.00
## Max. :8.000 Max. :6.0000 CA 2144 : 6 Max. :512.33
## (Other) :852
## Cabin Embarked
## :687 : 2
## B96 B98 : 4 C:168
## C23 C25 C27: 4 Q: 77
## G6 : 4 S:644
## C22 C26 : 3
## D : 3
## (Other) :186
summary(test)## PassengerId Pclass
## Min. : 892.0 Min. :1.000
## 1st Qu.: 996.2 1st Qu.:1.000
## Median :1100.5 Median :3.000
## Mean :1100.5 Mean :2.266
## 3rd Qu.:1204.8 3rd Qu.:3.000
## Max. :1309.0 Max. :3.000
##
## Name Sex
## Abbott, Master. Eugene Joseph : 1 female:152
## Abelseth, Miss. Karen Marie : 1 male :266
## Abelseth, Mr. Olaus Jorgensen : 1
## Abrahamsson, Mr. Abraham August Johannes : 1
## Abrahim, Mrs. Joseph (Sophie Halaut Easu): 1
## Aks, Master. Philip Frank : 1
## (Other) :412
## Age SibSp Parch Ticket
## Min. : 0.17 Min. :0.0000 Min. :0.0000 PC 17608: 5
## 1st Qu.:21.00 1st Qu.:0.0000 1st Qu.:0.0000 113503 : 4
## Median :27.00 Median :0.0000 Median :0.0000 CA. 2343: 4
## Mean :30.27 Mean :0.4474 Mean :0.3923 16966 : 3
## 3rd Qu.:39.00 3rd Qu.:1.0000 3rd Qu.:0.0000 220845 : 3
## Max. :76.00 Max. :8.0000 Max. :9.0000 347077 : 3
## NA's :86 (Other) :396
## Fare Cabin Embarked
## Min. : 0.000 :327 C:102
## 1st Qu.: 7.896 B57 B59 B63 B66: 3 Q: 46
## Median : 14.454 A34 : 2 S:270
## Mean : 35.627 B45 : 2
## 3rd Qu.: 31.500 C101 : 2
## Max. :512.329 C116 : 2
## NA's :1 (Other) : 80
將兩個資料合併成一個名為titanic的dataframe,以利後續分析運用。
titanic <- bind_rows(train, test)觀察登船港口遺漏資料,發現Fare都是80元,Passenger Class是1。
embark_missing <- titanic %>%
filter(Embarked == "")
head(embark_missing)## PassengerId Survived Pclass Name
## 1 62 1 1 Icard, Miss. Amelie
## 2 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn)
## Sex Age SibSp Parch Ticket Fare Cabin Embarked
## 1 female 38 0 0 113572 80 B28
## 2 female 62 0 0 113572 80 B28
過濾掉Embarked遺漏值,以登船港口和乘客等級分別繪製票價盒鬚圖
embark_fare <- titanic %>%
filter(Embarked != "")
ggplot(embark_fare, aes(x = Embarked, y = Fare, fill = factor(Pclass))) +
geom_boxplot() +
scale_fill_brewer(palette = 7 , type = "div") +
ggtitle("登船港口和乘客等級之票價盒鬚圖")發現票價80元與C港口高階乘客票價中位數差不多,故將Embarked兩個遺漏值填補為C。
titanic$Embarked[titanic$Embarked == ""] <- "C"挑出遺漏Fare的該筆資料來觀看。
fare_missing <- subset(titanic,is.na(titanic$Fare))## PassengerId Survived Pclass Name Sex Age SibSp Parch
## 1044 1044 NA 3 Storey, Mr. Thomas male 60.5 0 0
## Ticket Fare Cabin Embarked
## 1044 3701 NA S
Pclass == ‘3’,Embarked == ‘s’,觀察符合這些條件的票價情況。
third_s <- titanic[titanic$Pclass == '3' & titanic$Embarked == 'S', ]## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 7.854 8.050 14.435 15.900 69.550 1
這些條件下Fare最多落在8左右,與其中位數差不多,故以此填補Fare遺漏值。
titanic$Fare[is.na(titanic$Fare)] <- median(third_s$Fare, na.rm = TRUE)首先排除不太相關的變數,保留有用的變數,也能避免程式跑太久。
mice_data <- titanic[, !names(titanic) %in% c('PassengerId','Name','Ticket','Cabin','Survived')]使用mice套件處理Age的遺漏值,method使用Random Forest。 再將填補完成的結果輸出,並將原本的Age欄位替換成填補好的資料。
mice_result <- mice(mice_data, method='rf',seed = 9487)
mice_output <- complete(mice_result)
titanic$Age <- mice_output$Age運用填補好各個遺漏值的資料來進行視覺化圖表繪製與觀察。首先要將填補好遺漏值的titanic資料還原成train與test資料。以具備Survived值的train data進行圖表繪製。
train <- titanic[1:891,]以下幾點觀察發現:
分別使用Random Forest及Classification And Regression Tree來建立預測模型。
train <- titanic[1:891,]
test <- titanic[892:1309,]
set.seed(9487)
rf_fit <- randomForest(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked , data = train, ntree = 200)
cart_fit <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked, data = train, method = "class")進行測試資料預測。
rf_prediction <- predict(rf_fit, newdata = test[, c("Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked")])
cart_prediction <- predict(cart_fit, newdata = test[, c("Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked")],type = "class")輸出符合上傳格式的csv。
to_submit_rf <- data.frame(test[, "PassengerId"], rf_prediction)
names(to_submit_rf) <- c("PassengerId", "Survived")
write.csv(to_submit_rf,file="to_be_submitted_rf.csv",row.names = FALSE)to_submit_cart <- data.frame(test[, "PassengerId"], cart_prediction)
names(to_submit_cart) <- c("PassengerId", "Survived")
write.csv(to_submit_cart,file="to_be_submitted_cart.csv",row.names = FALSE)使用隨機森林預測模型的結果分數為0.76555。
使用決策樹預測模型的結果分數為0.79426。